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Abstract 

This paper develops new insights into quantitative methods for the vahdation of compu- 
tational model prediction. Four types of methods are investigated, namely classical and 
Bayesian hypothesis testing, a reliability-based method, and an area metric-based method. 
Traditional Bayesian hypothesis testing is extended based on interval hypotheses on distri- 
bution parameters and equality hypotheses on probability distributions, in order to validate 
models with deterministic/stochastic output for given inputs. Formulations and imple- 
mentation details are outlined for both equality and interval hypotheses. Two types of 
validation experiments are considered - fully characterized (all the model/experimental 
inputs are measured and reported as point values) and partially characterized (some of 
the model/experimental inputs are not measured or are reported as intervals). Bayesian 
hypothesis testing can minimize the risk in model selection by properly choosing the model 
acceptance threshold, and its results can be used in model averaging to avoid Type l/II 
errors. It is shown that Bayesian interval hypothesis testing, the reliability-based method, 
and the area metric-based method can account for the existence of directional bias, where 
the mean predictions of a numerical model may be consistently below or above the corre- 
sponding experimental observations. It is also found that under some specific conditions, the 
Bayes factor metric in Bayesian equality hypothesis testing and the reliability-based metric 
can both be mathematically related to the p-value metric in classical hypothesis testing. 
Numerical studies are conducted to apply the above validation methods to gas damping 
prediction for radio frequency (RF) microelectromechanical system (MEMS) switches. The 
model of interest is a general polynomial chaos (gPC) surrogate model constructed based on 
expensive runs of a physics-based simulation model, and validation data are collected from 



fully characterized experiments. 

Keywords: model validation, hypothesis testing, Bayesian statistics, reliability, MEMS 



1. Introduction 

Model validation is defined as the process of determining the degree to which a model is 
an accurate representation of the real world from the perspective of the intended use of the 
model PQ [2] . Qualitative validation methods such as graphical comparisons between model 
predictions and experimental data are widely used in engineering. However, statistics-based 
quantitative methods are needed to supplement subjective judgments and to systematically 
account for errors and uncertainty in both model prediction and experimental observation [Hj. 

Previous research efforts include the application of statistical hypothesis testing methods 
in the context of model validation jll-[Z| , and development of validation metrics as measures of 
agreement between model prediction and experimental observation [71-ITT]. Some discussions 
on the pros and cons of these validation methods can be found in [71 [T2] • Based on these 
existing methods and the related studies, this paper is motivated by several issues which 
remain unclear in the practice of model validation: (1) validation with fully characterized 
vs. partially characterized experimental data; (2) validation of deterministic vs. stochastic 
model predictions; (3) accounting for the existence of directional bias; and (4) choice of 
thresholds in different validation metrics. 

First, there are two possible types of validation data, resulting from (1) fully characterized 
experiments (i.e., all the inputs of the model/experiment are measured and reported as point 
values) or (2) partially characterized experiments (i.e., some inputs of the model/experiment 
are not measured or are reported as intervals). For instance, some input variables of the 
model/experiment may not be measured, but we may have expert opinions about the possible 
ranges or probability distributions of these input variables, and thus this experiment is 
"partially" characterized. In other words, there will be more uncertainty in the data from 
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partially characterized experiments than from fully characterized experiments, due to the 
uncertainty in the input variables. Some partially characterized experiments with limited 
uncertainty may be considered for validation by practitioners. The term "input" is referred 
to as the variables in a model that can be measured in experiments. We assume that the 
same set of variables goes into the model and validation experiments as inputs, and we are 
comparing the outputs of the model and experiments during validation. Therefore, the terms 
"model inputs" and " experimental inputs" mean the same thing in this paper. When a 
model is developed, the physical quantity Y is postulated to be a function of a set of variables 
{x, 6}. This function is not exactly known and hence is approximated using a model with 
output Ym- Y is observable through some experiments and x are the measurable inputs 
variables to the experiments. Note that are the variables that cannot be measured in the 
experiments and are called as " model parameters" . A simple example of the measurable 
experimental inputs is the amplitude of loading applied on a cantilever beam, while the 
deflection of the beam is the measured quantity. Also note that the diagnostic quality and the 
bias in experiments are not considered as " input" . Instead, they are classified as components 
of the measurement uncertainty, which is represented hj Ed in this paper. While most of 
the previous studies only focus on validation with fully characterized experimental data, this 
paper explores the use of both types of data in various validation methods. 

Second, due to the existence of aleatory and epistemic uncertainty, both the model 
prediction (denoted as Ym) and the physical quantity to be predicted (denoted as Y) can be 
uncertain, and this has been the dominant case studied in the literature jSj-lHl |10l IHl [13] . 
However, in practice it is possible that either Y^, or Y can be considered as deterministic. 
Note that Ym is deterministic means that for given values of the model input variables, 
the output prediction of the model is deterministic. The application of various validation 
methods to these different cases will be covered in this paper. 

Third, in this study, we defined two terms to characterize the difference between model 

prediction and validation data - bias and directional bias. Bias is defined as the difference 

between the mean value of model predictions and the statistical mean value of experiment 

observations, and the term "directional bias" means that the direction of bias remains 

3 



unchanged as one varies the inputs of model and experiment. This paper explores various 
validation methods in order to account for the existence of the directional bias. 

Fourth, although different validation metrics are usually developed to measure the agree- 
ment between model prediction and validation data from different perspectives, this paper 
shows that under certain conditions some of the validation metrics can be mathematically 
related. These relationships may help decision makers to select appropriate validation metrics 
and the corresponding model acceptance/rejection thresholds. 

Various quantitative validation metrics, including the p- value in classical hypothesis 
testing [H], the Bayes factor in Bayesian hypothesis testing p5], a reliability-based metric [7], 
and an area-based metric [10], [11], are investigated in this paper. Based on the original 
definition of Bayes factor, we formulate two types of Bayesian hypothesis testing, one on the 
accuracy of predicted mean and standard deviation of model prediction, and the other one on 
the entire predicted probability distribution of the model prediction. These two formulations 
of Bayesian hypothesis testing can be applied to both fully characterized and partially 
characterized experiments. The use of these two types of experimental data in the other 
validation methods is also investigated. The first formulation of Bayesian hypothesis testing, 
along with the modified reliability-based method and the area metric-based method, takes into 
account the existence of directional bias. The mathematical relationships among the metrics 
used in classical hypothesis testing, Bayesian hypothesis testing, and the reliability-based 
method are investigated. 

Section |2] presents the general procedure of quantitative model validation in the presence 

of uncertainty. Section [3] and |4] investigate the aforementioned model validation methods for 

(1) fully characterized and partially characterized experimental data, (2) application to the 

case when model prediction and the quantity to be predicted may or may not be uncertain, 

(3) sensitivity to the existence of the directional bias, and (4) the mathematical relationships 

among some of these validation methods. A numerical example is presented in Section |5] to 

illustrate the validation of a MEMS switch damping model, which is a generalized polynomial 

chaos (gPC) surrogate model jT6] that has been constructed to predict the squeeze-film 

damping coefficient. The gPC model is used to replace the expensive micro-scale fluid 
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simulation model and thus expedite the probabilistic analysis of the MEMS device. 

2. Quantitative validation of model prediction 

Suppose a computational model is constructed to predict an unknown physical quantity. 
Quantitative model validation methods involve the comparison between model prediction 
and experimental observation. In this paper, we use the following notations 

- Y represents the "true value" of the system response 

- is the model prediction of this true response Y 

- Yd is the experimental observation of Y 

The development of validation metrics is usually based on assumptions on y, Ym, and 
Yjj, and these assumptions relate to the various sources of uncertainty and the types of 
available validation data. In order to select appropriate validation methods, the first step is 
to identity the sources of uncertainty and the type of validation data. 

As mentioned earlier, the available validation data can be from fully characterized or 

partially characterized experiments. In the case of fully characterized experiments, the 

model/experimental inputs x are measured and reported as point values. The true value of 

the physical quantity (Y) and the output of model (Fm) corresponding to these measured 

values of x will be deterministic if there are no other uncertainty sources existing in the 

physical system and the model. Note that Y and Y^ can still be stochastic because of other 

uncertainty sources other than the input uncertainty. For example, the Young's modulus of 

a certain material can be stochastic due to variations in the material micro-structure, and 

the output of a regression model for given inputs is stochastic because of the random residual 

term. If the experiment is partially characterized, some of the inputs x are not measured 

or are reported as intervals, and thus the uncertainty in x should be considered. In the 

Bayesian approach, the lack of knowledge (epistemic uncertainty) about x is represented 

through a probability distribution (subjective probability). Then, since both Y and Ym 

are considered as functions of x, they also get treated through probability distributions. 

Non-probabilistic approaches have also been proposed to handle the epistemic uncertainty; 

in this paper, we only focus on probabilistic methods. 
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Note that results from the addition of measurement uncertainty to the true value of 
the physical quantity Y, i.e., Yd = Y + sd, where Ed represents measurement uncertainty. 
Hence, the uncertainty in the experimental observation (Yd) can be split into two parts, 
the uncertainty in the physical system response (Y) and the measurement uncertainty in 
experiments (^d)- It should be noted that experimental data with poor quality can hardly 
provide any useful information on the validity of a model. The discussions in this paper 
are restricted to the cases where uncertainty in data (due to the uncertainty in measuring 
experimental input and output variables) is limited. 

Table [l] summarizes the applicability of the various validation methods investigated in 
this paper to the different scenarios discussed above, and more details will be presented in 
Sections |3] and HI 



Table 1: Scenarios of validation and the corresponding methods 



Experimental data 


Quantity Y 

(to be predicted) 


Prediction Ym 
(from model) 


Applicable 
methods 




Stochastic 


Deterministic 


1,2,4 


Fully characterized 


Deterministic 


Stochastic 


1,2,4,5 




Stochastic 


Stochastic 


1,2,3,4,5 


Partially characterized 


Stochastic 


Stochastic 


1,2,3,4,5 



Methods considered: 

1. Classical hypothesis testing 

2. Bayesian interval hypothesis testing 

3. Bayesian equality hypothesis testing 

4. Reliability-based method 

5. Area metric-based method 

Note: Yd is always treated as a random variable due to measurement uncertainty 

After selecting a validation method and computing the corresponding metric, another 
important aspect of model validation is to decide if one should accept or reject the model 
prediction based on the computed metric and the selected threshold. Section |3] and |4] will 
provide some discussions on the decision threshold. The flowchart in Fig. [T] describes a 
systematic procedure for quantitative model validation. 
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model prediction 
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Model 
prediction 




Stochastic or 
Deterministic 




Select validation method 
and calculate metric 


> 


> 



Figure 1: Decision process in quantitative model validation 

3. Hypothesis testing-based methods 

Statistical binary hypothesis testing involves deciding between the plausibility of two 
hypotheses - the null hypothesis (denoted as Hq) and the alternative hypothesis (denoted 
as Hi). Ho is usually something that one believes could be true, whereas Hi may be a 
rival of H^ [T7] . For example, given a model for damping coefficient prediction, Hq can be 
the hypothesis that the model prediction is equal to the actual damping coefficient, and 
correspondingly Hi states that the model prediction is not equal to the actual damping 
coefficient. The null hypothesis Hq will be rejected if it fails the test, and will not be rejected 
if it passes the test. Two types of error can possibly occur from this exercise: rejecting the 
correct hypothesis (type I error), or failing to reject the incorrect hypothesis (type II error). 
In the context of model validation, it should be noted that the underlying subject matter 
domain knowledge is also necessary for the implementation of the hypothesis testing-based 
methods, especially in the formulation of test hypotheses {Hq and Hi) and the selection 
of model acceptance threshold. To formulate appropriate Hq and Hi for the validation 
of a model with stochastic output prediction Ym, we need to be clear about the physical 
interpretation of "model being correct". In other words, we need to decide whether or not 
the accurate prediction of certain order moments or the entire PDF of Ym suggests that the 
model is correct. 
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3.1. Classical hypothesis testing 

Classical hypothesis testing is well established and has been explained in detail in many 
statistics textbooks. A brief overview is given here, only to facilitate the development of 
mathematical relationships between various validation methods in later sections. 

In classical hypothesis testing, a test statistic is first formulated and the probability 
distributions of this statistic under the null and alternative hypothesis are derived theoretically 
or by approximations. Thereafter, one can compute the value of the test statistic based on 
validation data and thus calculate the corresponding p-value, which is the probability that 
the test statistic falls outside a range defined by the computed value of the test statistic 
under the null hypothesis. The p-value can be considered as an indicator of how good the null 
hypothesis is, since a better Hq corresponds to a narrower range defined by the computed 
value of the test statistic and thus a higher probability of the test statistic falling outside 
the range. 

The practical outcome of model validation should be to provide useful information for 
decision making in terms of model selection. The decisions whether or not to reject the null 
hypothesis can be made based on the acceptable probabilities of making type I and type 
II errors (specified by the decision maker). The concept of significance level a defines the 
maximum probability of making type I error, and the probability of making type II error j3 
can be estimated based on a and the probability distribution of the test statistic under Hi. 
The null hypothesis will be rejected if the computed p-value is less than a, or the computed 
/3 exceeds the acceptable value. Correspondingly, we will reject the model if Hq is rejected, 
and accept the model if Hq is not rejected. An alternative approach to comparing p-valuc 
and a is to use confidence intervals. A confidence interval can be constructed based on the 
confidence level 7 = 1 — a, and the null hypothesis will be rejected if the confidence interval 
does not include the predicted value from the model. 

It should be note that accepting the model (or failing to reject Hq) indicates that the 

accuracy of the model is acceptable, but it does not prove that the model (or Hq) is true. 

Also note that the comparison between p- value and significance level a becomes meaningless 

when the sample size of experimental data is large. Since almost no null hypothesis Hq 
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is true, the p-value will decrease as the sample size increases, and thus Hq will tend to 
be rejected at a given significance level a as the sample size grows large pT]. In addition, 
the over-interpretation of values and the corresponding significance testing results can be 
misleading and dangerous for model validation. Criticisms on over-stressing p-values and 
significance levels can be found in flEl [TU] . 

Various test statistics have been developed corresponding to different types of hypotheses. 
The t-test or z-test can be used to test the hypothesis that the mean of a normal random 
variable is equal to the model prediction; the chi-square test can be used to test the hypothesis 
that the variance of a normal random variable is equal to the model prediction; and the 
hypothesis that the observed data come from a specific probability distribution can be 
tested using methods such as the chi-square test, the Kolmogorov-Smirnov (K-S) test, the 
Anderson-Darling test and the Cramer test [20j . The tests on the variance or the probability 
distribution require relatively large amounts of validation data and thus only the tests on 
the distribution mean are discussed in this subsection, namely the t-test and the z-test. 

The t-test is based on Student's t-distribution. Suppose the quantity to be predicted Y 
is a normal random variable with mean /i and standard deviation cr, and the measurement 
noise is also a normal random variable with zero mean and standard deviation an- Thus, 
the experimental observation Yd = F + ~ cr^ + a"^). For the sake of simplicity, let 
cTy^ = a/o"^ + cr|,. The validation data is a set of random samples of Yd with size n (i.e., yni, 
yD2, VDn) and the corresponding sample mean is Yd and sample standard deviation is Sd- 
The variable (Yd — fi)/{SD/y/n) is modeled with a t-distribution with {n — 1) degrees of 
freedom. Therefore, if one assumes that the model mean prediction fi^ (if model prediction 
is deterministic, Hm equals to the prediction value) is the mean of Y, i.e., the null hypothesis 
is Hq : fi = fim, then the corresponding test statistic t (follows the same t-distribution) is 



The ]?- value for the two-tailed t-test (considering both ends of the distribution) can be 



t = 




•m 



(1) 
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obtained as 

p = 2FT,„_i(-|t|) (2) 

where -Fr.n-i is the cumulative distribution function (CDF) of a t-distribution with {n — 1) 
degrees of freedom. If the chosen significance level is a, then one will reject the null hypothesis 
Hq if p < a, or fail to reject Hq if p > a. 

The t-test requires a sample size n > 2 in order to estimate the sample standard deviation 
So- If n = 1, the 2;-test can be used instead by assuming that the standard deviation of Y is 



equal to the standard deviation of model prediction 1^, i.e., a = a.m and ay^ = v cr^ + ^d- 
Thus, the variable (Yd — jj) / {aY,j I ^/n) follows the standard normal distribution. Under the 
null hypothesis Hq : ft = ^rn, the test statistic z is 

^=^^ (3) 
The corresponding p-value for the two-tailed 2;-test can be computed as 

P = 2$(-k|) (4) 

where $ is the CDF of the standard normal variable. Similar to the t-test, one will reject Hq 
if p < a, or fail to reject Hq if p > a. 

To compute the probability of making type II error /3, an alternative hypothesis Hi 
is needed and a commonly seen formulation is ifi : /i = fim + C/x- In t-test, under the 
alternative hypothesis Hi : fi = fi^ + ^fi, the t statistic follows a non-central t-distribution 
with noncentrality parameter 6 = ^/nefJ_/aYo |2T| [22] . the probability of making type II error 
P can then be estimated as 

13 = 1 - Pr{\t\ > ti_^/2,n-i\S) (5) 
where the term Pr(|)f:| > ta/2,n-i\S) is called the power of the test in rejecting Hq. Similarly, 



10 



(3 in the z-test can be estimated as 



P = 1- Pr{\z-6\><l>-\l-a/2)) (6) 

Note that the above discussion is for the case that both Y and are stochastic. If Y 
is deterministic, the standard deviation a becomes zero; if Ym is deterministic, am becomes 
zero. However, the computation procedure of p-value remains the same. 

Applying classical hypothesis testing to fully characterized experiments is straightforward 
as one can directly compare the data with the model predictions for given inputs. For partially 
characterized experiments, some of the inputs of the model/experiments are available in the 
form of intervals or probability distributions based on measurements or expert opinions. Let 
data that have inputs with the same intervals or probability distributions form a sample 
set. The aforementioned t-test and 2;-test can then be conducted by comparing the mean of 
the sample set with the mean of the unconditional probability distribution of model output 
(" unconditional" means that the probability distribution is not dependent on the point values 
of inputs). The unconditional probability distribution of model output can be obtained by 
propagating uncertainty from the input variables to the output variable [23]. 

3.2. Bayesian hypothesis testing 

In probability theory, Bayes' theorem reveals the relation between two conditional 
probabilities, e.g., the probability of occurrence of an event A given the occurrence of an 
event E (denoted as Pr{A\E))^ and the probability of occurrence of the event E given the 
occurrence of the event A (denoted as Pr{E\A)). This relation can be written as |24j 

Suppose event A is the observation of a single vahdation data point i/d, and event E 
can be either the hypothesis Hq is true or the hypothesis Hi is true. Using Bayes' theorem, 
we can calculate the ratio between the posterior probabilities of the two hypotheses given 
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validation data ?/_d as 

Pr{H^\yD) ^ PriynlHo) ^ PrjHo) 

Pr{H,\yD) Pr{yD\Hi) * Pr{H{) ^ ' 

where Pr{HQ) and Pr{Hi) are the prior probabihties of Hq and Hi respectively, represent- 
ing the prior knowledge one has on the validity of these two hypotheses before collecting 
experimental data; and Pr{Ho\yD) and Pr{Hi\yo) are the posterior probabilities of Hq 
and Hi respectively, representing the updated knowledge one has after analyzing the col- 
lected experimental data. The likelihood function Pr{y£)\Hi) in Eq. [s] is the conditional 
probability of observing the data yo given the hypothesis Hi {i = or 1), and the ratio 
Pr{yE,\Ho) / Pr{yD\Hi) is known as the Bayes factor [151 125] and is used as the validation 
metric. 

The original intent of the Bayes factor was to compare the data support for two models j26] , 
and thus the two hypotheses become Ho: model Mi is true and Hi: model Mj is true. If 
6i and 6j are the parameters of the two competing models respectively, the Bayes factor is 
calculated as 

^ PriyplHo) ^ J PriyD\eMOi)dei 

Pr{yr>\Hi) J Priy 016^)7,(6 j)d0j ^ ' 

where 7r(0j) and vr(0j) are the probability density distributions of 6i and 6j respectively. 

In the context of validating a single model, Hq and Hi need to be formulated differently. 
Rebba and Mahadevan |5i [7] proposed the equality-based formulation [Hq : y^ = yo, Hi : 
Vm 7^ Vd) and the interval-based formulation {Hq : \ym — ynl < ^,Hi : \ym — yol > e) for 
Bayesian hypothesis testing, where ym is the model prediction for a particular input x. 

Consider the case when both the model prediction Ym and the quantity to be predicted 
Y are random variables. Two null hypotheses can be formulated: (1) the hypothesis that the 
difference between the means of Ym and Y, and the difference between the standard deviations 
of Ym and Y, are within desired intervals respectively; (2) the hypothesis that the PDF of 
Ym is equal to the PDF of Y. With the first formulation, it is straightforward to derive the 
likelihood functions under the null and alternative hypothesis, and the existence of directional 
bias can be reflected in the test, as will be shown below. The advantages of the second 
formulation are that it avoids the setting of interval width in the first formulation, and leads 

12 



to a direct test on probability distributions instead of distribution parameters. For the case 
that either Y or Ym is deterministic, the first formulation can still be applicable by setting the 
standard deviation of the deterministic quantity to be zero; however, the second formulation 
only applies to the case when both Y and are stochastic. These two formulations are 
applicable to both fully characterized and partially characterized experiments. Note that in 
the case where the model output follows a tail-heavy distribution, formulating hypotheses on 
higher order moments (instead of the mean and standard deviation) may be necessary in 
order to assess the validity of the model. In this paper, the prediction distribution of the 
damping model in the application example is close to a Gaussian distribution. Therefore, we 
only consider hypotheses on the first two moments (mean and standard deviation) and the 
entire PDF for the purpose of illustration. 

Interval hypothesis on distribution parameters. The interval hypothesis can be formulated 
as Hq : e^i < Hm - jJi < ^ai < CTm - <y < eo-2, and Hi : //^ - > e^2 or - < 
C/ii, ~ 0" > £(72 or (Jm — o" < ^ai- l^m and /i are the means of Y^ and Y respectively, and 
(Tjn and cr are the standard deviations of Ym and Y respectively, e^i, £^^2, Co-i and £0-2 are 
constants which define the width of interval. Note that e^i < e^2-i ^ai < ^a2- 

Under the interval hypothesis Hq, /i can be any value between [/i^ — e^2, A*m ~ C/ii]- 
So /X ~ Unif(/x^ - 6^,2, l^m - e^i), and the PDF 7ro(/x|/Xm) = l/(e;,2 - e^i)- Similarly, a ~ 
Unif( cr^ - e<^2, c^m - £0-1), and the PDF 7ro{<^Wm) = V (£0-2 - e<Ti)- Thus 

Tro{y\lJ,m,Crm) = j j T^ivll^-, (^)T^o{l^\l^m)T^o{(y\(ym)dnda 

= 7 w r / \ / 7i{y\fx,a)dfx\da (10) 

In the presence of measurement noise, the experimental observation is a random variable 
with conditional probability Pr{yD\y)- Hence the likelihood function under the null hypothesis 
Hq can be derived as 

Pr{yD\Ho) = J Pr{yD\y)7io{y\lJ'm,crm)dy (11) 
13 



Under the alternative hypothesis Hi, jj, can be any value outside [/i^ — ^^l2l l^m — e^ti]) 
but the uniform distribution is not applicable to infinite space in practical cases. To avoid 
this issue, we can assume that the possible values of /x are within a finite interval [/i;, 
based on the underlying physics. Therefore /x ~ Unif(/ii, /i^ — 6^2) U (/i^ — e^i, Hu)^ and the 
PDF 7ri(/i|/i^) = 1/ - ^1 + e^i - e^a)- Similarly, a ~ Unif(cr,, cr^ - 6^2) U (cr^ - e^i, 
and the PDF ni{o\am) = 1/ (o-„ - a/ + e^i - 6^2)- thus 




7ri(?/|yUm,am) =11 'n:{y\jj,,a)'n:i{j2\j^rn)TTi{(T\am)diJ.da 

A 



(12) 



where A is calculated as 

A = \ I / 7r(?/|/i, o-)rf/i + / 7r(y|/i,o-)d/i|d(7 + 

/ I / 7r(?/|/i,(T)rf/i + / 7r(?/|/i,o-)d/i|rfcr (13) 

The likelihood function under Hi can then be derived as 

Pr{yD\Hi) = J Pr{yD\y)ni{y\ix^,an,)dy (14) 

The Bayes factor for the Bayesian interval hypothesis testing can be calculated by dividing 



Pr{yD\HQ) in Eq. 11 by Pr{yD\Hi) in Eq. 14 



It is straightforward to apply this method to the case that Y.^ is deterministic and the 
case that Y is deterministic. For the first case, let am be zero and the rest of the computation 
remains the same. For the second case, the interval assumption will only be made on /i and 
/im, since we know a is zero. The other steps will be the same as above. 

The directional bias defined in Section [T] can be captured by conducting two separate 
Bayesian interval hypothesis tests. In the first test, we set e^i = — and e^2 = 0, and thus 
under the null hypothesis — < Hm — < 0. In the second test, we set e^i = and e^2 = e^t, 
and thus under the null hypothesis < /i^ " < e^i- The model will fail if any of these two 
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null hypotheses fails the corresponding test. Therefore, the existence of directional bias will 
increase the chance of a model to fail the combined test. Fig. [2] illustrates this combined test 
using the concept of data space. Suppose Z is the overall validation data space, Z\ is the set 
of data which does not support the model in the first Bayesian interval hypothesis test, and 
Z2 is the set of data which does not support the model in the second test. Then, the union 
of Z\ and Z2 is the set of data that does not support the model combining these two tests. 




Figure 2: Graphical illustration of the combined test 

Equality hypothesis on probability density functions. To further validate the entire distribu- 
tion of predicted by a probabilistic model, Hq or Hi can be formulated correspondingly 
as the predicted distribution n^ym) being or not being the true distribution of the quantity 
to be predicted Y, i.e., i^o • '^iym) = 7r(?/), and Hi : vr(ym) 7^ 7r(y). The Bayes factor in this 
case becomes 

^ ^ Pr{yD\Ho) ^ J Pr{yD\y)'Ko{y)dy 

Pr{yn\Hi) j PriyoWMdy ^ ^ 

where Pr{y£,\y) is the conditional probability of observing noisy data yn given that the 
actual value of Y is y; TTo{y) is the PDF of Y under the null hypothesis Hq and hence 
iTo{y) = T!'{yra); T!'i{y) is the PDF of Y under the alternative hypothesis Hi. If no extra 
information about TTi{y) is available, it can be assumed as a non-informative uniform PDF. 
Note that the bounds of this uniform distribution will affect the value of the estimated Bayes 
factor, and thus it should be carefully selected according to the available information. 

Note that Pr{y£,\y) is proportional to the value of the PDF of Yd conditioned on y which 
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is evaluated at Yd = y, i.e., Pr{yj:)\y) oc 7r(?/£)|y)- Therefore, Eq. 15 can be rewritten as 



/7r(|/D|l/)vri(?/)dy 

Validation data from fully /parti ally characterized experiments. If the vaUdation data is 
from a fully characterized experiment, i.e., all the input parameters x of the experiment 
are measured and the point values of x are available, /Zm and am used in the Bayesian 
interval hypothesis testing are the mean and standard deviation of the model prediction 
given the input x, and the PDF of (7r(?/m)) used in the Bayesian equality hypothesis 
testing is conditioned on x. If the experiment is partially characterized, i.e., some of input 
X corresponding to observation yD are not measured or are reported as intervals, we can 
assume that x have the PDF 7i{x) based on reported intervals or expert opinions. One can 
first calculate the unconditional PDF of model prediction Ti{iim) via propagating uncertainty 
from X to model output Ym 

T^iVm) = J 7r{ym\x)n{x)dx (17) 

and then calculate fim and am from 7r{ym)- If data from both fully characterized and partially 
characterized experiments are available, we can first calculated Bayes factors corresponding 
to these two types of data points separately using different /i^ and am (in the Bayesian 
interval hypothesis testing), or 'n{ym) (in the Bayesian equality hypothesis testing) as shown 
above, and then multiply these Bayes factors to obtain an overall Bayes factor, as discussed 
below. 

Bayesian hypothesis testing with multiple data points. If there are (A^ > 1) validation 
data points available and the corresponding experiments are conducted independently, 
i.e., no correlation exists between different data points, according to the basic rule of 
probability theory, the probability of observing the whole data set Pr{yD) is the product 
of the probabilities of observing individual data points Pr{y£)i), i = 1,2, ...,N. Since the 
likelihood functions Pr{yo\HQ) and Pr{yo\Hi) are essentially probabilities of observing the 
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data, after computing the Bayes factor for each data point, these individual Bayes factors can 
be multiplied to compute the overall Bayes factor under the assumption that the observations 
are independent, as 

N 

B = l[Bi (18) 

i=l 

If the number of data points N is relatively large and most of -B^'s are larger than one, 
the product of individual Bayes factors may also be a large number. In such a case it is more 
convenient to express Bayes factor on a logarithmic scale as 

N 

log 5 = log 5, (19) 

i=l 

Interpretation of Bayesian hypothesis testing results. If the Bayes factor calculated is greater 
than 1, it is indicated that the data favors the null hypothesis; if the Bayes factor is less than 
1, it is indicated that the data favors the alternative hypothesis. In addition, Jeffreys [TF] 
gave a heuristic interpretation of Bayes factor in terms of the level of support that the 
hypotheses obtain from data. The threshold value of Bayes factor Bth can be related to the 
so-called Bayes risk in detection theory [281 129] , which is the sum of costs due to different 
decision scenarios - failing to reject the true/ wrong hypothesis and rejecting the true/ wrong 
hypothesis. It has been shown that appropriate selection of Bth can help reduce the Bayes 
risk [28]. If one assumes that the cost of making correct decisions (failing to reject the true 
hypothesis or rejecting the wrong hypothesis) is zero, the costs of type I and type II error 
are the same, and the prior probabilities of the null and alternative hypothesis being true 
are equal, then the resulting Bth = 1 [29]. However, It should be noted that as a part of 
the decision making process, the choice of thresholds for Bayes factor inevitably contains 
subjective elements. 

Before collecting validation data, there may be no evidence to support or reject the 
model. In that case, it may be reasonable to assume that the prior probabilities of the null 
hypothesis and alternative hypothesis are equal (= 0.5). In that case, a simple expression of 
the posterior probability of the null hypothesis can be derived in terms of the Bayes factor [5] , 
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which is a convenient metric to assess the confidence in the model prediction: 



Pr{Ho\yD) = 



Pr{yD 


Ho)Pr{Ho) 


PrivD 


Ho)Pr{Ho 


) + PrivD 


Hi)Pr{H,) 



Pr{yD\Ho) 
PriyolHo) + PriyolH: 







B 

1 + B 



An advantage of Bayesian hypothesis testing is that the posterior probabihties of Hq 
and Hi obtained from the vahdation exercise can both be used through a Bayesian model- 
averaging approach fl3[ 1301 |3T] to reflect the effect of the model validation result on the 



uncertainty in model output, as shown in Eq. 21 



7f(y) = 7roiy)Pr{Ho\yD) + My)Pr{Hi\yn) (21) 



where 7t{y) is the predicted PDF of Y combining the PDFs of Y under the null and alternative 
hypothese. Therefore, instead of completely accepting a single model, one can include the 
risk of using this model in further calculations. This helps to avoid both Type I and Type II 
errors, i.e., accepting a wrong model or rejecting a correct model. 

3.3. Relationship between p-value and Bayes factor 

Although the p-value in classical hypothesis testing and the Bayes factor B are based 
on different philosophical assumptions and formulated differently, it has been shown that 
these two metrics can be mathematically related for some special cases [32j. In the discussion 
below, the Bayes factor based on the hypothesis of probability density functions for a fully 
characterized experiment is found related to the p-value in t-test and 2;-test, if the model 
prediction Ym is a normal random variable with mean /i^ and standard deviation am- 

Starting from the formula of Bayes factor in Eq. [16} since we assume that the PDF of the 
quantity to be predicted Y under the alternative hypothesis Hi is uniform, the integration 
term in the denominator is not affected by the target model and thus can be treated as a 
constant 1/C. Based on the null hypothesis Hq, the quantity to be predicted Y ~ N{jirn, 
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Recall the relationship Yd = Y+ed, and ~ N{0, ajj), we know that Yd ~ N^fi^, ^m^^'o) 



Thus the numerator of Eq. [16] can be calculated as 



Tx{yD\y)T^i){y\x)dy 



) 



(22) 



where is the PDF of the standard norm random variable. 

If the variance of measurement noise is negligible compared to the variance of Y^, i.e., 
cr|) ^ cr^, we have a'^ + aj^ ~ Also note that for a single data point Yd = yo- Therefore 



Eq. 16 becomes 



Based on Eqs. [T]and[3| we have 



(23) 



t * So/y/n , for t-test 
z * (Jyu I ) z-test 



(24) 



Substituting Eq. 24 into Eq. 23, we obtain 



B 



C/cTm * 0[(t * 5'z))/(amA/n)] 



for t-test 



C/am * (piiz * gyo)I {(ym^)\ , for 2;-test 



(25) 



where is the probability density function of a standard normal variable. 

From Eq. |25j we can see that the Bayes factor can be related to either the z statistic or 
the t statistic, and hence it can be related to the p- value in both z-test and t-test. Combining 
Eqs. |4]and 25, we obtain the relation between Bayes factor and the p- value in the 2;-test as 



B 



C 



2 a^^n 



(26) 



where $ ^ is the inverse standard normal CDF. Similarly, the relation between Bayes factor 
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and the p- value in the t-test can be obtained by combining Eqs. |2] and 25 as 



C 



(27) 



where Frpl^_-^ is the inverse CDF of a t-distribution with {n — 1) degrees of freedom. 

If the chosen significance level in 2;-test or t-test is a, the corresponding threshold Bayes 



factor Bth can be calculated using Eq. 26 or 27 by letting p = a. In that case, the 2;-test /t-test 
with significance level a and Bayesian hypothesis testing with the corresponding threshold 
value Bfh will both give the same model validation result. 



4. Non-hypothesis testing-based methods 

Besides the binary hypothesis testing methods discussed above, various other validation 
metrics have been developed to quantify the agreement between models and experimental 
data from other perspectives, such as the Mahalanobis distance for models with multivariate 
output [21], the weighted validation data-based metric [8], the Kullback-Leibler divergence 
in the area of signal processing [33] and for the design of validation experiments |21], the 
probability bound-based metric |35j, the confidence interval-based metric [2], the reliability- 
based metric [7| , and the area metric [TO] [TT] . The weighted validation data-based metric 
introduced by Hills and Leslie [8] is designed for the case when the importance of different 
validation experiments is different. The confidence interval-based validation method proposed 
by Oberkampf et al. |9] computes the confidence interval of error, which is defined as the 
difference between the model mean prediction and the true mean of the variable to be 
predicted. An average absolute error metric and an average absolute confidence indicator 
are also computed. However, it is not clear how to apply this method to validation of 
a multivariate model with data from discrete test combinations, as the method requires 
integration over the space of test inputs. Therefore, only the reliability-based metric and the 
area metric are discussed in this paper. 
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4-1. Reliability-based metric 

The reliability metric r proposed by Rebba and Mahadevan [7] is a direct measure of 
model prediction quality and is relatively easy to compute. It is defined as the probability of 
the difference d between model prediction and observed data being less than a given quantity 
e 

r = Pr(-e < d < e) d = YD-Y^ (28) 

As mentioned in Section |2| experimental observation is random due to measurement 
uncertainty and model output may be uncertain in the Bayesian framework. Therefore, as 
the difference between two random variables, in the Bayesian framework we interpret d as 
a random variable. The probability distribution of d can be obtained from the probability 
distributions of Yd and Ym- For instance, if the model prediction Y^ ~ N{fim, cr^), and the 



corresponding observation ~ Ni^^.ay ) (see discussion in Section 3.1), the difference 



d ~ A^(/i — Urn, <J^ + <Ji) + <Jm)- For the sake of simplicity, let ad = y cr'^ + + 0"™- this 
case, the reliability-based metric r can be derived as 

, = _ $[Zi^i/i^] (29) 

In this paper, experimental data are considered as the samples of the random variable Yd- 
Therefore, if the number of experimental data is relatively large, e.g., n > 30, the sample 
variance S^) can be assumed as a good estimator of (the true variance of Y^), which 
is needed to compute the reliability metric. If n is small and no prior information on a is 
available, we can assume that a = am, which is the same assumption used in z-test. By 



assuming further that the mean of validation data Y^ is equal to /i, Eq. 29 can be re- written 
as 

r = $[^^15^^] - (30) 



By substituting Eq. 24 into Eq. pOl the relation between the reliability-based metric r 
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and the test statistic in the t-test or 2;-test is obtained as 



me-t*SD /V^) /(Td] + ^[{e + t*SD /V^) /^d] - 1 , for t-test 
r = < (31) 

<l>[(e - z* aYo/y/n)/ad] + $[(e + z* (Ty^,/ y/n)/ad] - 1 , for 2;-test 



By combining Eqs. |2| |4]and 31 the rehabihty-based metric can be further related to the 
p-value in the t-test or z-test as 



"^[(e + ^T,n-i(p/2) * Sd/V^)/^^] - 1 , for t-test 
$[(6 - <^-\p/2) * aYjV^)/cTd] + 

$[(e + $"^(p/2) * aYo/V^)/crd\ - 1 , for 2;-test 



(32) 



If one chooses to test models based on a threshold reliability value rth calculated by 



letting p = a in Eq. 32 above, the result of model validation will be the same as that in the 
t-test or z-test with significance level a. 

Note that the threshold rth used in the reliability-based method represents the minimum 
probability of the difference d falling within an interval [— e, e], and the decision of accept- 
ing/rejecting a model can be made based on the decision maker's acceptable level of model 
reliability. 

Since the reliability-based metric is the probability of d being within a given interval, it 
can also reflect the existence of directional bias by modifying the intervals. Similar to the 
Bayesian interval hypothesis testing, we can take two different intervals [0, e] and [— e, 0], 
and calculate the corresponding values of metric and as: 



e - (/i - fi„ 



(^d 



-e - (/i - 
o-d 



(33) 



By comparing the values of and with the threshold rthf^ (half of the original 
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threshold value because the width of intervals considered is half of the original one), the 
model will be failed if either or is less than rj/i/2. 

Note that for the case that the quantity to be predicted Y is deterministic, a becomes 
zero, and for the case that the model prediction Ym is deterministic, am becomes zero. 

4-2. Area metric-based method 

The area metric proposed by Person et al. [101 HI] is attractive due to its capabihty 
to incorporate fully characterized experiments using the so-called "u-pooling" procedure, 
and thus to validate models with sparse data on multiple experimental combinations [12] . 
For a single experimental combination with input Xi, suppose F^. is the corresponding 
cumulative probability function (CDF) of model output Y^ and ya is the observed data, 
one can compute a w-value for this experimental combination as 

U^ = F:^{yD^) (34) 

Based on the probability integral transform theory in statistics ^36j, if the observed data 
Udi is a random sample from the probability distribution of model output, the computed 
Ui will be a random sample from the standard uniform distribution, and thus the empirical 
CDF of all the Wj's [i = 1,2,..., N) should match the CDF of the standard uniform random 
variable. The difference between these two CDF curves is a measure of the disparity between 
model predictions and experimental observations. Hence, a model validation metric can be 
developed as [10] 

d{Fu,Su)= / \Fi,-Su\du (35) 



Jo 

where F^ is the empirical CDF of all the Wj's and is the CDF of the standard uni- 
form random variable. If the value of d{Fu, S^) is small/large, the model predictions are 
correspondingly close/not close to experimental observations. 

If the model prediction Y^ is deterministic, the CDF function F^^ in Eq. 34 cannot 
be constructed, and hence the area metric-based method is not applicable to testing a 
computational model with deterministic output. 
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The area metric can also reflect the existence of directional bias, i.e., the experimental 

observations are consistently below or above the corresponding mean predictions of numerical 



model. This is due to the use of CDF of model output in Eq. 34 For example, if the model 
outputs at different test combinations are Gaussian random variables, the -F^'s in Eq. 34 
become Gaussian CDFs. Hence, the values of FJ^i{yDi) will all be less than 0.5 if yoi^ are 
smaller than the mean of the corresponding Gaussian variables. Therefore, instead of being 
uniformly distributed between [0,1], Wj's are distributed between [0,0.5], causing a large area 
between the empirical CDF of Ui and the standard uniform CDF. 

Compared to the hypothesis testing methods and the reliability-based method, the area 
metric-based method lacks a direct interpretation of model acceptance threshold, i.e., it is 
not clear how to set up an appropriate threshold to decide when one should reject /accept a 
model. This is a disadvantage for the area metric-based approach. 

5. Numerical example 

In this section, the aforementioned model validation methods are illustrated via an 
application example on damping prediction for MEMS switches. The quantity to be predicted, 
the damping coefficient, is considered as a random variable due to the lack of understanding 
in physical modeling, in other words, the epistemic uncertainty of damping coefficient is 
represented by a subjective probability distribution following the Bayesian way of thinking; 



the corresponding computational model is also stochastic as will be shown in Section 5.1.1 
The validation data are obtained from fully characterized experiments, and it is found that 
the directional bias defined in Section [T] exists between model prediction and validation data. 

5.1. Damping model and experimental data 

Despite the superior performance provided in terms of signal loss and isolation compared 
with silicon devices [31], the use of RE MEMS switches in applications requiring high 
reliability is hindered by significant variations in device lifetime [38]. Rigorous quantification 
of the uncertainty sources contributing to the observed life variations is necessary in order to 
achieve the design of reliable devices. Within the framework of uncertainty quantification in 
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the modeling of RF MEMS switches, the vahdation of squeeze-film damping model emerges 
as a crucial issue due to two factors: (1) damping strongly affects the dynamic behavior 
of the MEMS switch and therefore its lifetime [39j; (2) it is difficult to accurately model 
micro-scale fluid damping and available models are applicable to limited regimes ^0] . 

5.1.1. Uncertainty quantification in micro-scale squeeze-film damping prediction 

For the purpose of illustration, this study considers damping prediction using the Navier- 
Stokes slip jump model [H]. Two major sources of uncertainty have been shown to affect 
the prediction of gas damping jSH]- The first one is epistemic uncertainty related to the lack 
of understanding of fundamental failure modes and related physical models. The second 
one is aleatory uncertainty in model parameters and inputs due to variability in either the 
fabrication process or in the operating environment. Uncertainty quantification approaches 
usually require large numbers of deterministic numerical simulations. In order to reduce the 
computational cost, a generalized polynomial chaos (gPC) surrogate model [16J is constructed 
and trained using solutions of the Navier-Stokes (N-S) equation for a few input points, thus 
avoiding repetitively solving the N-S equation. Note that several other surrogate modeling 
techniques are also available, including Kriging or Gaussian Process (GP) interpolation |42] . 
support vector machine (SVM) [3^, relevance vector machine [H], etc. The gPC model is 
used for the purpose of illustration. This model approximates the target stochastic function 
using orthogonal polynomials in terms of the random inputs [EB]- A P*^ order gPC model 
ym{x) that approximates a random function y{x) can be written as 



where 0j's are the orthonormal polynomial bases such as Legendre polynomials, Hermite 
polynomials, and Wiener- Askey polynomials; is the dimension of input x and P is the 
order of the polynomial; Sm is the error of the gPC model; a^'s are coefficients and can be 




(36) 
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obtained as 

J ym{x)(f)i{x)dx 1 ^ f \ fo^\ 

= f^K^)dx = h^^^^y^^M^i^^) (37) 

where hi = J (f)f{x)dx is constant for a given polynomial basis (f)i{x), and {xj,Wj}jLi is a 
set of nodes and weights of the quadrature rule for numerical integration. 

Based on the calculated damping coefficient values y{xj) at the quadrature nodes xj by 
solving the Navier-Stokes Slip Jump model, the gPC model ym{x) can be constructed using 



Eqs. 36 and 37 For any given input x^^ ^^m{xk) = Xlfli ^a'Pii'^k) is deterministic, while the 



residual term is random. Under the Gauss-Markov assumption, Sm asymptotically follows 
a Gaussian distribution with zero mean, and the variance can be estimated as [i5l 116] 



al = a'[l + 0^(a..)($^$)-VK)] (38) 

where cr^ is a function of model input x^, the vector 4>{xk) = [(f)i{xk),(p2{xk),---,(pM{xk)]'^; 
the matrix $ = [cj){xi),(f){x2),...,(f)ixN)f; and = 1/{N - M) E^Li[/"m(^i) - vixj)]^. 

Therefore, for a given input x^., the prediction of damping coefficient based on the gPC 
model is a random variable with Gaussian distribution N{fim{xk),<^m{xk))- The methods 
presented in Sections |3] and |4] will be applied to the validation of this predicted distribution. 

The example RF MEMS switch modeled as a membrane is shown in Fig. |3j To construct 
a gPC model for the damping coefficient, the input parameters x need to be specified first. 
A probabilistic sensitivity analysis shows that the membrane thickness t, the gap height g, 
and the frequency u are the major sources of variability in the damping coefficient, and 
hence these three parameters are included in the gPC model, i.e., x = [t, g, u]. Four different 
gas pressures - 18798.45 Pa, 28664.31 Pa, 43596.41 Pa, and 66661.19 Pa - are considered 
and correspondingly four gPC models are constructed. This example uses a third order gPC 
model with Legendre polynomial bases 

It should be noted that the validity of the surrogate model does not guarantee the validity 
of the original model. We only have access to the surrogate model and validation experimental 
data; therefore in this example we are only assessing the validity of the surrogate model. 
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Ni membrane (~l-3 i^m thick) 
Ti layer (~250 A) 




e 



Figure 3: Example RF MEMS switch (Courtesy: Purdue PRISM center) 

5.1.2. Experimental data for validation 

In the experiment, seven devices with different geometric dimensions are considered. For 
each of the four pressures mentioned above, 5 repetitive tests are conducted on each of the 
seven devices, and hence 140 data points are obtained. Since the input parameters [t, u] 
are recorded for each of the data points, these experiments are fully characterized. However, 
there is only one data point for each test combination due to the fact that each of the 140 
input value combinations is different from others. 



Fig. 4(a) shows a graphical comparison between the mean gPC model prediction and 
experimental data under the four different pressures by aggregating predictions and data 
with respect to the 35 test combinations for each pressure value. The top/bottom points are 
correspondingly the maximum/minimum value of model mean predictions and experimental 
data, and the square/diamond markers are the average values of predictions/data on the 35 
test combinations. A more detailed graphical comparison showing mean prediction of the 
gPC model vs. experimental data on each of the individual test combinations is provided in 



Figs. 4(b) 



From the graphical comparison, we can see that the gPC model performs better under 
the middle two values of pressure. Also note that there is a systematic bias between the 
gPC model and experimental observations at the low pressure value (18798.45 Pa), i.e., the 
mean predictions of the gPC model are always larger than the experimental data. 
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Figure 4: Graphical comparisons between gPC predictions and experimental data 
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5.2. Validation based on binary hypothesis testing 
5.2.1. Classical hypothesis testing 

Because the sample size for each experimental combination is only 1, the t-test is not 
applicable and instead 2;-test is used. The p- values calculated using Eq. |4]are shown in Fig. |5j 
The dashed lines in Fig. [5] represent the significance level a = 0.05. The model is considered 
to have failed at the experimental combinations where the corresponding p- values fall below 
the dashed line. Note that a more rigorous test will need to include the probability of making 
type II error The individual numbers of failures of the four gPC models are shown in 
Table [21 
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Figure 5: p-value of z-test 
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Table 2: Performance of gPC models in z-test with a = 0.05 



Pressure (Pa) 18798.45 28664.31 43596.41 66661.19 

Number of failures 10 5 7 20 

Failure percentage 28.6% 14.3% 20.0% 57.1% 



5.2.2. Bayesian hypothesis testing 



Interval hypothesis on distribution parameters. As discussed in Section 3.2, combination of 
two Bayesian hypothesis tests based on the interval null hypotheses Hq and Hq respectively 
can reflect the existence of directional bias. In practical, the parameters e^, Co-i, and €^2 that 
define the intervals can be determined based on the strictness requirement of validation. For 
the purpose of illustration, we set = 0.025, e„i = —0.005, and £^2 = 0.005. /i; and /!„ 
that define the possible range of are set as and 1 respectively since the MEMS device 
considered is under-damped, ai and au are set to be 0.001 and 0.05 respectively. The results 



of Bayesian interval hypothesis testings are calculated using Eq. [TO]- 14, and are shown in 
Fig. [6] and Table [3j 

Table 3: Performance of gPC models in interval-based Bayesian hypothesis testing with log Btt = 







Pressure (Pa) 


18798.45 


28664.31 


43596.41 


66661.19 




-^pi < Aim - /i < 

£(71 < 0-m - o- < e^2 


Number of failures 
Overall Bayes factor 


10 
3.1 


5 
58.3 



92.9 


10 
44.1 


Hi: 


< yUm - < 
£(71 < 0"rrt - 0" < e^2 


Number of failures 
Overall Bayes factor 


5 
63.9 


4 
87.1 


5 
74.1 


14 
1.4 




Combined test 


Number of failure 
Failure percentage 


10 
28.6% 


5 

14.3% 


5 

14.3% 


16 
45.7% 



Equality hypothesis on probability density functions. In this study, the possible values of 
damping coefficient range from to 1 since the system is under-damped. Hence the limit 



of integration in the denominator of Eq. 16 is [0, 1], while the limit of integration in the 
numerator is [— oo,-|-oo]. 
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Figure 6: Bayes factor in interval-based hypothesis testing (on logarithmic scale) 

The performance of the gPC models in Bayesian hypothesis testing are shown in Fig. [7] 
and Table |4j The values of Bayes factor are calculated using Eq. [16} and the threshold 



Bayes factor Bth = 1 (this threshold value is chosen based on the discussion in Section 3.2). 
Although the performance of the gPC model in terms of failure percentage is different for the 
two hypothesis tests as shown in Table [2] and Table [4], if one increases the threshold Bayes 



factor Bth to 2.88, which is calculated using Eq. [26] with p = 0.05 in Section [373[ the result of 
Bayesian hypothesis testing in terms of the number of failures becomes the same as in the 



2;-test in Section 5.2.1 The reason for this coincidence has been explained in Section 3.3 



Note that the performance of the second gPC model (for pressure = 28664.31 Pa) remains 



the same when Bth is raised from 1 to 2.88, and this can be easily observed from Fig. 7(b) 
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Figure 7: Bayes factor in equality-based hypothesis testing (on logarithmic scale) 
Table 4: Performance of gPC models in equality-based hypothesis testing with log Bth — 



Pressure (Pa) 


18798.45 


28664.31 


43596.41 


66661.19 


Number of failures 


5 


5 


3 


15 


Failure percentage 


14.3% 


14.3% 


8.6% 


42.9% 


Overall Bayes factor (log-scale) 


7.4 


57.2 


72.3 


-10.2 



By comparing the results based on interval hypothesis on distribution parameters and 
equality hypothesis on probability density functions (Tables |3] and |4]), it can be observed that 
the performance of the gPC model for pressure 18798.45 Pa in the first test is significantly 
worse than in the second test, while the models for the other three pressures have similar 



failure percentages in these two tests. As shown in Fig. 4(b) , the data are all located below 
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the mean predictions of this gPC model, which indicates the existence of directional bias, 
and thus the gPC model for pressure 18798.45 Pa performs worse in the Bayesian interval 
hypothesis testing. 

5.3. Validation using non-hypothesis testing-based methods 
5.3.1. Reliability-based metric 

Fig. [s] and Table [s] show the calculated values of the reliability-based metric r, and 
(Eq. [29| and [33|) , the failure percentage of models with e = 0.025 and the decision criterion 

= 0.2325. This decision criterion is obtained using Eq. |32]with the significance level 
a = 0.05, and thus the results of validation (comparing r with rth) in terms of failure 
percentage are the same as in the z-test in Section |5.2.1[ It can also observed that the failure 
percentage of the gPC model for pressure 18798.45 Pa increases significantly in the test that 
comparing and with rth/ '2 due to the existence of directional bias. 

Table 5: Performance of gPC models in reliability-based method with rth = 0.69 

Pressure (Pa) 18798.45 28664.31 43596.41 66661.19 

Number of failures 10 5 7 20 

^ Failure percentage 28.6% 14.3% 20.0% 57.1% 

I 2 Icy Number of failures 20 7 12 25 

r and r vs. rthi I y^Amxq percentage 57.1% 20.0% 34.3% 71.4% 



5.3.2. Area metric-based method 



The area metric for the four gPC models can be computed using Eqs. 34 and 35, and 
the results are shown in Fig. [9] and Table [6j Note that the gPC model for pressure 18798.45 
Pa has the highest area value and thus performs worst. This is due to the directional bias 
between mean predictions and experimental data, and it is reflected in the area metric as 
discussed in Section 14.21 

5.4. Discussion 

This section demonstrated a numerical example of validating the gPC surrogate model 

for the RF switch damping coefficient using the validation methods presented in Sections [3] 

33 



1.5 



Ci3 



^ 0.5 

o 



18798.45 Pa 



-rth = 0.2325 



XXXXXxxXxX 



++++++++++xxxxx X 

+++++x¥?+x 
OoooOooOoO + 

ooooo^jooOo 



***** 



X* xx 

+++++ 
o 



10 20 30 

Test combination number 



(a) 



1.5 



^ 0.5 



28664.31 Pa ° 

+ ^2 

---rt,, = 0.2325 

X XXX X xxx 

^X xX X xXXX^xxXXx ''xxX 
oo*oooOj,o + + ++++++++x + + ^+oOooo 



■f"'" + ooooqooOOo Oo° 



10 20 30 

Test combination number 



(b) 



1.5 



03 



^ 0.5 

o 



43596.41 Pa 



---rt,, = 0.2325 

X^XXXXx X X xX 



""Xx 

°°°o°j^4*r^^:x; 



-JT T - + 
kkH + ++ 



+++++ 



'^xx 

0+ + 



10 20 30 

Test combination number 



(c) 



1.5 



<A 



^0.5 



66661.19 Pa 



xxxxxx xXX 



-rth = 0.2325 - 



"4? 





K » ^ o 

¥»«+*+«+»+o|ic«8 



10 20 30 

Test combination number 



(d) 



Figure 8: Reliability-based metric 



Table 6: Area metric for gPC models 



Pressure (Pa) 


18798.45 


28664.31 


43596.41 


66661.19 


d{Fu, Su) 


0.543 


0.146 


0.151 


0.250 



and|4} and 140 fully characterized experimental data points. Based on the performance of the 
gPC model in these validation tests, it can be concluded that the prediction of the gPC model 
has better agreement with observation under the middle two values of pressure (28664.31 Pa 
and 43596.41 Pa), whereas less agreement can be found under the lowest and highest pressure 
values (18798.45 Pa and 66661.19 Pa). The decision on model acceptance can be formed 
based on the failure percentages with the hypothesis testing methods and the reliability-based 



34 



1.5 



Q 
O 



0.5 



1.5 



18798.45 Pa CDF of u; 

- - Uniform CDF 



0.5 

u 



(a) 



43596.41 Pa CDF of Ui 

- - Uniform CDF 




(c) 



1.5 



1 . 



Q 
O 



0.5 



CDF of Ui 

- - Uniform CDF 




(b) 



66661.19 Pa 



-CDF of u. 
Uniform CDF 



0.5 

u 



(d) 



Figure 9: Empirical CDF of Ui and standard uniform CDF 



method, and the values of area-based metric, given the desired acceptance threshold. It is 
shown that the z-test and the reliability-based metric give the same results in terms of failure 
percentage when r^^ is selected based on the significance level a used in z-test. Similarly, 
classical and Bayesian hypothesis testing give the same result by choosing a specific threshold 



Bayes factor as shown in Section 3^, It is also shown that the existence of directional bias 
can be reflected in the Bayesian interval hypothesis testing, reliability-based method with 
modified intervals, and the area metric-based method. Models that have directional bias will 
perform worse in these three validation methods than in classical hypothesis testing and in 
Bayesian hypothesis testing with equality hypothesis on probability density functions. 
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6. Conclusion 

This paper explored various quantitative validation methods, including classical hypothesis 
testing, Bayesian hypothesis testing, a reliability-based method, and an area metric-based 
method, in order to validate computational model prediction. The numerical example 
featured a generalized polynomial chaos (gPC) surrogate model which predicts the micro- 
scale squeeze-film damping coefficient for RF MEMS switches. 

An Bayesian interval hypothesis testing-based method is formulated, which validates the 
accuracy of the predicted mean and standard deviation from a model, taking into account the 
existence of directional bias. Further, Bayesian hypothesis testing to validate the entire PDF 
of model prediction is formulated. These two formulations of Bayesian hypothesis testing can 
be applied to both fully characterized and partially characterized experiments, and the case 
when multiple validation points are available. It is shown that while the classical hypothesis 
testing is subject to type I and type II error, the Bayesian hypothesis testing can minimize 
such risk by (1) selecting a risk-based threshold, and (2) subsequent model averaging using 
posterior probabilities. It is observed that under some conditions, the p-value in the 2;-test 
or t-test can be mathematically related to the Bayes factor and the reliability-based metric. 

The area metric is also sensitive to the direction of bias between model predictions and 
experimental data, and so is the reliability-based method. The Bayesian model validation 
result and reliability-based metric can be directly incorporated in long-term failure and 
reliability analysis of the device, thus explicitly accounting for model uncertainty, whereas 
the area-based metric lacks a direct interpretation for its results. In addition, due to the 
use of likelihood function in the Bayesian hypothesis testing, the Bayesian model validation 
method can be extended to the case that the validation data is in the form of interval, as 
shown in [m HE] . 
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